Skip to content

fix(upload_file): unstick concurrent uploads stalling at 30s#803

Open
jrvb-rl wants to merge 1 commit into
mainfrom
fix/upload-file-408-stalls
Open

fix(upload_file): unstick concurrent uploads stalling at 30s#803
jrvb-rl wants to merge 1 commit into
mainfrom
fix/upload-file-408-stalls

Conversation

@jrvb-rl
Copy link
Copy Markdown
Contributor

@jrvb-rl jrvb-rl commented May 14, 2026

Summary

Customers running many concurrent upload_file calls through a single SDK instance were hitting bursts of 408 Request timed out waiting for request data from mux/Jetty. Triage on 2026-05-14 traced this to three things, all introduced by recent SDK changes:

  • feat: share HTTP connection pool across SDK instances; refactor polling #797 lowered the default httpx pool from (100, 20) → (20, 10) and made it process-global. With many parallel multipart bodies sharing 20 TCP connections instead of 100, per-stream bandwidth dropped to where the server's 30 s request-body idle timeout fires before the body finishes arriving.
  • _should_retry blanket-retries 408, so a stalled upload immediately queues another large multipart body onto the same exhausted pool — amplifying the stall instead of recovering from it.
  • _files.py read the entire PathLike into memory via read_bytes() before the request was even built, contradicting the docstring's claim of streaming. Holds the pool slot during disk I/O for large files.

Changes

File Change
_constants.py DEFAULT_CONNECTION_LIMITS restored to (max_connections=100, max_keepalive_connections=20).
_base_client.py _should_retry returns False for 408s on /upload_file. Other 408s, 429s, 5xxs still retry.
_base_client.py New connection_limits: httpx.Limits | None = None on SyncAPIClient / AsyncAPIClient. Passing it gives the client a private pool with the requested limits (implicitly opts out of the shared pool — that's process-global and can't honor per-client limits).
_client.py Exposes connection_limits on Runloop / AsyncRunloop constructors and copy() / with_options().
_files.py _transform_file / _async_transform_file return path.open(\"rb\") (an io.IOBase) instead of read_bytes(). httpx streams from the handle.
tests/ Updated test_files.py to expect streamed handles, added 2 new _should_retry tests, added TestSyncConnectionLimits / TestAsyncConnectionLimits covering the new option.
examples/stress_upload_file.py New standalone repro that fires N concurrent upload_file calls and reports per-status counts and p50/p95/p99 latency. Knobs: --concurrency, --total, --file-size-kb, --max-retries, --max-connections, --max-keepalive.

Usage of the new option

import httpx
from runloop_api_client import AsyncRunloop

# Raise the cap for upload-heavy workloads:
client = AsyncRunloop(
    connection_limits=httpx.Limits(max_connections=300, max_keepalive_connections=50)
)

Test plan

  • uv run pytest tests/ — 1079 passed locally
  • New regression tests: test_upload_file_408_does_not_retry, test_non_upload_file_408_still_retries, TestSyncConnectionLimits, TestAsyncConnectionLimits
  • Manual repro on examples/stress_upload_file.py against prod with --concurrency 50 --total 200:
    • Pin to v1.20.3, expect a burst of 408s at ~30 s each
    • Switch to this branch, expect 0 408s and lower wall time
  • Optional: customer pre-release try-out, watch their 408 rate in Honeycomb mux

Notes for reviewers

  • Three independent fixes are bundled because they share a single customer-visible symptom and a single repro path — separating would force whoever is bisecting in production to back-port them individually. Easy to split if preferred.
  • _constants.py is Stainless-generated; the (100, 20) value just restores the pre-feat: share HTTP connection pool across SDK instances; refactor polling #797 default and we should probably move this off the generated file longer-term so it stops being a release-time merge conflict candidate.
  • HTTP/2 is left enabled (from Improve devbox polling: eliminate client-side sleep, enable HTTP/2 #793). It may still hurt multipart throughput under contention; happy to add http2: bool = True as a sibling knob if that's a follow-up we want.

🤖 Generated with Claude Code

…erver

Customers running many concurrent upload_file calls through a single SDK
instance were hitting 408 "Request timed out waiting for request data"
from mux/Jetty. Triage on 2026-05-14 found the request-body simply wasn't
arriving within Jetty's 30 s data-wait window. Three contributing causes,
all introduced by recent SDK changes:

1. DEFAULT_CONNECTION_LIMITS was lowered from (100, 20) to (20, 10) in
   #797. Combined with the new shared process-global pool, parallel
   uploads divide bandwidth across far fewer TCP connections than before
   and starve each other on multipart bodies. Restore the prior (100, 20).

2. _should_retry blanket-retries 408, including upload_file. Each retry
   pushes another large multipart body onto the same exhausted pool and
   amplifies the stall instead of recovering from it. Carve out
   upload_file 408 — other endpoints still retry 408 as before.

3. _files.py read the entire PathLike into memory via read_bytes() before
   the request was built, contradicting the docstring claim of streaming.
   Hand httpx an open file handle so the multipart encoder reads lazily
   and the pool slot isn't held during disk I/O. Tests updated to assert
   IsInstance(io.IOBase) instead of IsBytes() and to close handles.

Also exposes a public connection_limits knob on Runloop / AsyncRunloop
(and copy()) so customers can raise the cap above 100 for upload-heavy
workloads without needing a custom httpx client. Passing
connection_limits implicitly opts out of the shared pool, since the
shared pool is process-global and can't honor per-client limits.

Includes examples/stress_upload_file.py — a standalone repro that fires
N concurrent upload_file calls and reports status-code distribution and
latency percentiles, so the before/after delta is one command away.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant